MiniMax-M3 MXFP8 full sweep config for GB300 by Oseltamivir · Pull Request #1735 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-06-13T01:06:19Z

Summary

Add minimaxm3-fp8-gb300-dynamo-vllm to nvidia-master.yaml with 7 topologies: TP4, TP8, TP4+EP4, 1P+1D disagg (2-node), 1P+1D collocated (1-node), DEP4, DEP8
All GB300 recipes include sbatch_directives: mem: "0" / cpus-per-task: "72" plus srun_options: mem: "0" (CW DefMemPerCPU=4096 cgroup fix — step-level mem=0 alone only grants what the job allocation already holds) and omit safetensors prefetch (host-memory limit)
All recipe YAMLs included under minimax-m3-gb300-fp8/{1k1k,8k1k}/
Concurrency sweep: TP 4-64, TEP 128-512, disagg 64-512, DEP4 256-1024, DEP8 512-2048

Test plan

GB300 disagg canary passed: run 27449360195 (landed on gb300-nv, so it did not exercise the CW cgroup cap)
CW cgroup fix (sbatch_directives mem=0): full-sweep run 27452273567 OOMed on gb300-cw before the fix; needs a re-run that lands on gb300-cw
CW cgroup fix verified: run 27452976271 cleared the OOM on gb300-cw (generated sbatch shows --mem=0 --cpus-per-task=72)
HF cache lock-race fix (HF_HUB_OFFLINE=1 in worker env): run 27452976271 then failed because both TP8-2n nodes raced on the shared HF blob .lock (Lock acquisition failed); workers now run cache-only against the launcher-pre-staged snapshot — needs a re-run on gb300-cw to confirm
Full sweep dispatched: TBD after merge

Note

Medium Risk
Touches multinode Slurm launch paths, shared NFS model cache, and a runtime in-place patch of installed vLLM code on MI300X; mis-staging or patch drift could fail jobs or skew benchmark validity, but scope is benchmark infra rather than production serving.

Overview
Adds MiniMax-M3 MXFP8 multinode benchmarking on GB300 (CoreWeave) and extends MI300X with EAGLE3 speculative decoding, plus CI/launcher fixes for large Hub downloads.

GB300: New minimaxm3-fp8-gb300-dynamo-vllm in nvidia-master.yaml drives disaggregated Dynamo+vLLM sweeps (1P1D TP4+EP4 and rack-scale 5P12D / 10P7D shapes for 1k1k and 8k1k). Four new srt-slurm recipes under minimax-m3-gb300-fp8/ configure NixlConnector KV transfer over multi-node NVLink (UCX_CUDA_IPC_ENABLE_MNNVL, enable-cumem-allocator), Slurm mem=0 / cpus-per-task=72, omit safetensors prefetch, and set HF_HUB_OFFLINE=1 on workers. launch_gb300-cw.sh wires minimaxm3/fp8 to those recipes, pre-stages the ~444 GB snapshot on shared NFS (/mnt/vast/hf-home), and clears Hub .lock races before srtctl apply.

MI300X: New minimaxm3-fp8-mi300x-vllm-mtp config and minimaxm3_fp8_mi300x_mtp.sh serve with EAGLE3 (Inferact/MiniMax-M3-EAGLE3, 3 tokens) and apply an in-place vLLM patch for missing SupportsEagle3 on the ROCm AMD model. launch_mi300x-amds.sh routes spec-decoding: mtp via _mtp script suffix. The base minimaxm3_fp8_mi300x.sh drops --enforce-eager and sets VLLM_USE_BREAKABLE_CUDAGRAPH=0 for CUDA graphs.

CI: benchmark-multinode-tmpl.yml exports HF_TOKEN so Slurm workers can pull day-zero Hub models without 429s.

^{Reviewed by Cursor Bugbot for commit 805dc1c. Bugbot is set up for automated code reviews on this repo. Configure here.}

Add minimaxm3-fp8-gb300-dynamo-vllm to nvidia-master.yaml with 7 topologies covering the full concurrency range: - TP4/TP8 (low latency, conc 4-64) - TP4+EP4 agg + 1P+1D disagg 2-node + 1P+1D collocated (mid, conc 64-512) - DEP4/DEP8 (high throughput, conc 256-2048) All recipe YAMLs included under minimax-m3-gb300-fp8/{1k1k,8k1k}/. GB300 recipes include srun_options mem=0 (CW DefMemPerCPU cgroup fix) and omit safetensors-load-strategy prefetch (host-memory limit).

github-actions · 2026-06-13T01:06:27Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-13T01:19:41Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27452223695
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27452223695

github-actions · 2026-06-13T01:35:06Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27452273567
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27452273567

srun_options.mem=0 only grants a step the job's existing allocation; on gb300-cw (DefMemPerCPU=4096, no DefCpuPerGPU) the job itself was only allocated 4 GB/node and workers were cgroup-OOM-killed during engine init (run 27452273567: oom_kill in StepId=7409.7 on slurm-gb300-133-193, worker RLIMIT showed 4194304 KB). The canary passed because it landed on gb300-nv, which doesn't enforce the cap. Mirrors the sbatch_directives block of the DSV4 agentic recipes.

github-actions · 2026-06-13T01:54:37Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27452976271
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27452976271

…h_model lock race With the mem fix in place, run 27452976271 cleared the OOM but hit a new failure: both nodes of the TP8-2n job called dynamo fetch_model within 200ms (191 @ :23.637, 193 @ :23.833), 191 took the per-blob .lock on the shared /mnt/vast/hf-home cache and held it verifying the 444 GB snapshot, 193 retried ~6.4s and died 'Lock acquisition failed' (dynamo's rust hub doesn't wait like Python hf_hub). The launcher already pre-stages and verifies the snapshot offline before submit, so the workers never need to fetch. Setting HF_HUB_OFFLINE=1 in every worker env block makes dynamo serve cache-only and skip the download lock entirely, so co-fetching workers no longer collide. Applied to all agg + disagg (prefill/decode) env blocks across the 11 recipes.

github-actions · 2026-06-13T02:07:56Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27453434847
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27453434847

github-actions · 2026-06-13T02:29:31Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27453693856
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27453693856

github-actions · 2026-06-13T04:01:42Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27453693856
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27453693856

The previous pin 062a5de9 (set by #1571 "chore: agentx v0.3") was the cjq/agentx-v0.3 tip on 2026-06-02, but that branch was later rebased/ force-pushed (now at ff2b646c) which orphaned 062a5de9; GitHub has since garbage-collected it. It is now unfetchable ("upload-pack: not our ref") and absent from every CI runner cache, so actions/checkout fails on any cold runner with "Unable to find current revision in submodule path utils/aiperf" (e.g. the newly-added gb300-cw runner-4, run 27453693856). Re-pin to the current cjq/agentx-v0.3 tip — the branch .gitmodules already declares, which is live/fetchable and contains the prior aiperf history as an ancestor. This makes the pin and the declared branch consistent again.

github-actions · 2026-06-13T04:56:47Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27453693856
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27453693856

github-actions · 2026-06-13T05:05:39Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27457134583
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27457134583

github-actions · 2026-06-13T07:33:16Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27457134583
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27457134583

Replace the aggregated M3 GB300 topologies with disaggregated-only, and enable NixlConnector KV transfer over multi-node NVLink on every disagg recipe. On gb300-cw the cross-node prefill->decode KV handoff was silently falling back to RDMA/TCP (~268 MB/s, ~1400 tiny descriptors for M3 MSA cache) — the disagg ceiling. Setting UCX_CUDA_IPC_ENABLE_MNNVL=y plus --enable-cumem-allocator (VMM-registers KV so NIXL uses cuda_ipc across the NVL fabric) lifts it to ~1.4-1.7 GB/s and gives +17% / +23% / +49% out tok/s/gpu at conc 64 / 128 / 256 (jobs 7490 base vs 7493 MNNVL, 1P1D TP4EP4). This is a GB300-only win: B300 8-GPU IB islands cannot move KV over multi-node NVLink. Sweep (1k1k), all MNNVL: - 1P1D TP4+EP4 collocated 1n (8 GPU), conc 8-256 - low/mid latency - 1P1D TP4+EP4 split 2n (8 GPU), conc 64-512 - mid throughput - 1P + DP16+EP wide decode 5n (20 GPU), conc 512-2048 - max throughput (decode keeps scaling on NVL where 1P1D saturates: ~1213 vs ~810 out tok/s/gpu @ conc 1024) Removes all agg-gb300 recipes (1k1k + 8k1k); applies MNNVL to the 8k1k disagg recipe too for consistency.

github-actions · 2026-06-13T21:15:20Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27479316691
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27479316691

The collocated-1n topology (disagg-gb300-1p1d-tp4ep4-1n) declared gpus_per_node: 8, but gb300-cw nodes have 4 GPUs — sbatch rejects it with "Requested node configuration is not available" even on a fully idle cluster (confirmed: fails standalone with 28 nodes free; the split-2n and wide-decode at gpus_per_node 4 schedule fine). It was an 8-GPU-node template artifact that never reached sbatch before. Remove it (1k1k + 8k1k) and let the split-2n cover the low-latency end (conc extended down to 8). Add the 8k1k (isl 8192) scenario mirroring 1k1k with the two valid disagg shapes (split-2n + wide DP16 decode), MNNVL KV transfer on both, seq params retuned for long context (max-model-len 9472) and lower concurrency.

github-actions · 2026-06-13T21:49:21Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27479506192
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27479506192

github-actions · 2026-06-13T23:10:53Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27480086800
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27480086800

…sagg config Adds a 17-node (full-rack) disagg topology to the M3 GB300 sweep (1k1k + 8k1k) from on-cluster tuning (gb300-cw): - PREFILL is the binding bottleneck, not decode width or KV transfer: a single prefill worker left ~3967 reqs queued and starved 64 decode GPUs. Balancing to 5 prefill : 12 decode (TP4) cleared the backlog and lifted throughput +57% (535 -> 843 out tok/s/gpu @ conc 2048). - TP-only decode (ep1, no expert parallelism) per the Qwen3.5-397B-A17B recipes (closest M3 analog); M3 wide-EP/DP-attention all-to-all was slower and DP32 < DP16 per-GPU. - Kept the existing 1p1d (low/mid latency) and dep16dec (wide-decode) topologies so CI measures the full Pareto rather than replacing them. NixlConnector KV transfer stays on multi-node NVLink (MNNVL + cumem); note KV transfer was verified NOT to bottleneck throughput (doubling its bandwidth via num_threads changed end-to-end tok/s/gpu by ~0). recipe yamls line up 1:1 with the nvidia-master.yaml CONFIG_FILE references.

github-actions · 2026-06-14T01:53:59Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27485224567
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27485224567

…-sweep

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 88e99ce. Configure here.}

cursor · 2026-06-14T02:47:57Z

+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "64x128x256x512"


Recipe omits low concurrencies

Medium Severity

The 1k1k 1P+1D sweep declares conc-list values 8, 16, and 32 in nvidia-master.yaml, but the recipe’s benchmark.concurrencies only runs 64 through 512. Multinode jobs use the recipe list via srtctl, not matrix CONC_LIST, so those low-concurrency points never execute.

Additional Locations (1)

.github/configs/nvidia-master.yaml#L11800-L11801

^{Reviewed by Cursor Bugbot for commit 88e99ce. Configure here.}

github-actions · 2026-06-14T05:02:25Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27486293814
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27486293814

… for 8k1k DSR1 GB300 patterns show wide-EP decode hurts M3's MoE all-to-all; independent TP4 decode workers are strictly better. Also, 8k1k is prefill-bound (616-req backlog at 5P:12D) — rebalance to 10P:7D per DSR1/DSV4's prefill-heavy long-context ratios. Changes: - Replace dep16dec (EP16 single decode) with 1P+4D (4x TP4 ep1 decode) for both 1k1k and 8k1k, same 5 nodes - Add 10P+7D TP4 ep1 (17 nodes) for 8k1k max throughput - Tighten concurrency ranges: 1P1D [4-32], 1P4D [64-512], 5P12D/10P7D [1024+]

github-actions · 2026-06-14T05:38:37Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27489662677
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27489662677

github-actions · 2026-06-14T08:44:29Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27489709722
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27489709722

… (MTP) MI300X recipe (#1749) * minimaxm3-fp8-mi300x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 MI300X recipe Adds the spec-decoding=mtp sibling of minimaxm3-fp8-mi300x-vllm, based on the MI300X non-MTP recipe + the MI355X MTP recipe. Keeps the MI300X serve shape (BF16 KV cache — gfx942 lacks calibrated ROCm FP8 attention scales — plus --no-enable-prefix-caching, TRITON_ATTN, --enforce-eager, minimax_m3 parsers) and adds the Inferact/MiniMax-M3-EAGLE3 draft via --speculative-config (method eagle3, 3 spec tokens) + chat-template prompts. Carries the same in-place EAGLE3 patch as the MI355X MTP recipe: the shipped ROCm image's AMD MiniMax-M3 model lacks SupportsEagle3, so the recipe patches the installed amd/model.py before serving (functionstackx/vllm#1, upstream vllm-project/vllm#45546; validated green on MI355X). Idempotent; hard-fails on base drift. TP8-only search space (gfx942 192 GB is memory-tight, like H100), TP8 latency rows started at conc 1, matching the H100/MI355X MTP recipes. Also adds SPEC_SUFFIX to launch_mi300x-amds.sh so spec-decoding=mtp routes to the _mtp script (the launcher hardcoded _mi300x.sh). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * perf-changelog: fill in PR link for minimaxm3-fp8-mi300x-vllm-mtp (#1749) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

* feat: add MiniMax M3 MI300X day-zero benchmark * chore: link MiniMax M3 MI300X changelog * fix: mount ROCm devices on MI300X * fix: disable prefix caching for MI300X MiniMax M3 * fix: use bf16 kv cache for MI300X MiniMax M3 * perf: enable MI300X MiniMax M3 CUDA graphs * chore: link MI300X CUDA graph changelog

…p --enforce-eager, VLLM_USE_BREAKABLE_CUDAGRAPH=0) (#1756) * minimaxm3-fp8-mi300x-vllm-mtp: run with CUDA graphs (drop --enforce-eager) Remove --enforce-eager from the MI300X EAGLE3 MTP recipe and set VLLM_USE_BREAKABLE_CUDAGRAPH=0, matching the non-MTP MI300X recipe (#1750). Avoids the M3-decode breakable-cudagraph path that previously forced eager execution. Re-sweeps minimaxm3-fp8-mi300x-vllm-mtp. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * perf-changelog: fill in PR link for minimaxm3-fp8-mi300x-vllm-mtp cudagraphs Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

Data from run 27489709722 showed: - 1P4D (20 GPU) strictly dominated by 1P1D (8 GPU): 320 vs 974 out/s/gpu @ conc 128 (1k1k). Single prefill can't feed 4 decode workers — 1P:4D ratio is too decode-heavy. - 8k1k 5P12D (68 GPU) dominated by 10P7D: 567 vs 874 out/s/gpu @ conc 1024. Prefill-heavy ratio is correct for long context. Changes: - Remove 1P4D recipes (both 1k1k and 8k1k) - Remove 8k1k 5P12D recipe (dominated by 10P7D) - Restore 1P1D to full concurrency range [8-512] 1k1k, [8-256] 8k1k (was truncated to [4-32] to avoid 1P4D overlap) Final GB300 configs: 1P1D (latency-to-mid) + rack-saturating (max tput) 1k1k: 1P1D [8-512] + 5P12D [2048-8192] 8k1k: 1P1D [8-256] + 10P7D [1024-4096]

Oseltamivir · 2026-06-14T08:56:26Z

/run_sweep

github-actions · 2026-06-14T10:42:02Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27493886226
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27493886226

Oseltamivir requested a review from a team June 13, 2026 01:06

Oseltamivir requested review from jgangani and kedarpotdar-nv as code owners June 13, 2026 01:06

github-project-automation Bot added this to InferenceMAX Board Jun 13, 2026

chore: update perf-changelog pr-link to #1735

e3fa89f